-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] [Bugfix] Refactor block manager subsystem for better testability #3492
[Core] [Bugfix] Refactor block manager subsystem for better testability #3492
Conversation
|
@simon-mo ready for review |
|
||
# Now that the batch has been created, we can assume all blocks in the | ||
# batch will have been computed before the next scheduling invocation. | ||
for seq_group in scheduler_outputs.scheduled_seq_groups: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes the model execution won't fail & there's no retry. It is the general assumption in the engine now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add this assumption to comments? What happen to later preemption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment added
What happen to later preemption
these lines do not change prefix caching behavior wrt later preemption (either case requires recomputing the blocks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh btw @rkooo567 this assumption is not introduced in this PR; the scheduler already assumes that its schedule is implemented by the engine. retries and failures are not yet in scope of vllm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work. The new structure is looking great. I don't have major issues therefore approving. I left some comments mostly due to my confusion when reading the code, which I would imagine others will run into as well.
tests/core/block/e2e/conftest.py
Outdated
def generator_outer(): | ||
for llm in generator_inner(): | ||
yield llm | ||
del llm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need another level of wrapper? would it work without it? if not please comment why
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the usage of the llm generator below but still confused since we are only yielding one llm instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh good catch, not necessary
vllm/core/block/interfaces.py
Outdated
pass | ||
|
||
|
||
class DeviceAwareBlockAllocator(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do wonder whether this can "inherit" BlockAllocator
somehow so we are only re-defining allocate_mutable
and allocate_immutable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea
|
||
# Now that the batch has been created, we can assume all blocks in the | ||
# batch will have been computed before the next scheduling invocation. | ||
for seq_group in scheduler_outputs.scheduled_seq_groups: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add this assumption to comments? What happen to later preemption?
@@ -196,6 +196,8 @@ def lora_int_id(self) -> int: | |||
return self.lora_request.lora_int_id if self.lora_request else 0 | |||
|
|||
def hash_of_block(self, logical_idx: int) -> int: | |||
# TODO This can produce incorrect hash when block size > prompt size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😨
vllm/core/block_manager_v2.py
Outdated
def access_all_blocks_in_seq(self, seq, now): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the todo here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comment (basically #3667)
def mark_blocks_as_computed(self, seq_group: SequenceGroup): | ||
# We ignore the sequence group as its not necessary. After the batch is | ||
# formed by the scheduler, we do not need to mark blocks from individual | ||
# sequence groups as computed -- all blocks in the batch can be marked | ||
# as computed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a no-op then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kinda -- the plumbing is here to validate the design, but prefix caching isn't tested e2e (need #3667)
caching. | ||
|
||
Args: | ||
create_block (Block.Factory): A factory function for creating new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we parameterize the type somehow so we are constraining the type to NaiveBlock
because it only works there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, open to more typing, but here we need it to return an unspecialized block. the prefix caching allocator will specify to the naive block allocator that it should construct prefix caching blocks instead of the default naive block.
I will add a comment
|
||
def free(self, block: Block) -> None: | ||
block_id = block.block_id | ||
block.block_id = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed? plz comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
Refcount = int | ||
|
||
|
||
class NaiveBlockAllocator(BlockAllocator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be called CoWBlockAllocator btw because it's actually quite powerful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this -- the conflict is that the prefix caching allocator also supports CoW (it isn't unique to this allocator).
Thanks for the review 🙏. Applying feedback |
This PR refactors the block manager / allocator / prefix caching to make things easier to test. Concretely, it establishes the following layers (diagrams from https://docs.google.com/document/d/1ipAypZYfZgloP_08sLi1Z2BADHMauWScbT_H1F0ThCQ/edit?pli=1):
The interfaces are such that the BlockAllocator can be implemented via a NaiveBlockAllocator (e.g. no prefix caching) or a PrefixCachingBlockAllocator. The value of this approach is in its separation of concerns and consequent improved testability.
This PR is missing the following features before the BlockManagerV2 can become the default.
computed
bit.Testing
Design
Key APIs
Important interactions
Swapping GPU/CPU
Copy-on-write
Prefix caching promotion